Skip to content

Conversation

@BeritJanssen
Copy link
Collaborator

The current branch uses the parlamint_v4.py utils file from #1730. I changed it in the following ways:

  • added a function to read people and party metadata from the external files (included in the test data)
  • added various functions for named entity fields, and parsers to enrich with named entity annotations
  • added functions to flatten the annotated format of the speeches
  • removed functions that were copied without change from parliament.utils.parlamint.py utils

@BeritJanssen BeritJanssen requested a review from Meesch June 11, 2025 14:38
@BeritJanssen BeritJanssen changed the title Update Netherlands Recent corpus definition to read from Parlamint v4 data Update Netherlands Recent corpus definition to read from Parlamint v4 data, including Named Entities Jun 11, 2025
Copy link
Contributor

@lukavdplas lukavdplas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine! I added some requests for clarity, but other than that, feel free to merge.

Comment on lines +19 to +22
"""
This file was created as an updated utils file for the ParlaMint dataset, version 4.0. The previous utils file
is based on version 2.0.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note: module level docstrings should be the first line of the file. If you put it after the imports, it's not registered by help, code editors, etc.

I think by "the previous utils file", you're referring to parliament.py? This reference is ambiguous. Also, someone not familiar with the code history would not know which file came first chronologically.

If I were looking over a directory with parliament.py and parliament_v4.py, I would probably assume that parliament.py covers the most recent version and should be the default for new corpora, and the v4 module is for compatibility with some older version.

else:
return False

def transform_current_party_id(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could do with a docstring. What is data? What is is transformed into?

Based on the name, I assumed this function transformed the current party ID into something else, but based on the code, it looks like this function retrieves the current party ID from other data? If so, the name is a bit misleading.

def annotated_text_mapping():
return {'type': 'annotated_text'}
def ner_mapping():
return {'type': 'text', 'index': False}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the issue with the 'annotated_text' type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1724

TLDR: it wasn't necessary because we already have the unannotated text for full-text search, and keyword fields for finding entities.

@BeritJanssen BeritJanssen merged commit d0e76da into develop Jul 14, 2025
1 check passed
@BeritJanssen BeritJanssen deleted the feature/parlamint-ner branch July 14, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants